Model Selection

Open-vocabulary recognition

# Open-vocabulary recognition

OPENCLIP SigLIP Tiny 14 Distill SigLIP 400m Cc9m

A lightweight vision-language model based on the SigLIP architecture, extracting knowledge from the larger SigLIP-400m model through distillation techniques, suitable for zero-shot image classification tasks.

Image Classification

Llmdet Swin Tiny Hf

LLMDet is a powerful open-vocabulary object detector supervised by large language models, capable of zero-shot object detection.

Object Detection

Eva02 Large Patch14 Clip 224.merged2b

The EVA CLIP model is a vision-language model based on OpenCLIP and timm model weights, supporting tasks such as zero-shot image classification.

Image Classification

Eva02 Enormous Patch14 Clip 224.laion2b Plus

EVA-CLIP is a large-scale vision-language model based on the CLIP architecture, supporting tasks such as zero-shot image classification.

Vit Huge Patch14 Clip 224.metaclip Altogether

CLIP model based on ViT-Huge architecture, supporting zero-shot image classification tasks

Image Classification

Resnet101 Clip.openai

A CLIP model based on ResNet101 architecture, supporting zero-shot image classification tasks.

Image Classification

Owlv2 Large Patch14 Ensemble

OWLv2 is a zero-shot text-conditioned object detection model that can detect objects in images through text queries.

Thomasboosinger

Owlv2 Base Patch16

OWLv2 is a zero-shot text-conditioned object detection model that can detect and locate objects in images through text queries.

Owlv2 Large Patch14 Finetuned

OWLv2 is a zero-shot text-conditioned object detection model that can detect objects in images through text queries without requiring category-specific training data.

Owlv2 Base Patch16 Finetuned

OWLv2 is a zero-shot text-conditioned object detection model that can retrieve objects in images through text queries.

Object Detection

CLIP ViT L 14 CommonPool.XL.clip S13b B90k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification and cross-modal retrieval

CLIP ViT B 32 CommonPool.M.clip S128m B4k

Zero-shot image classification model based on CLIP architecture, supporting general pooling functionality

Eva02 Large Patch14 Clip 224.merged2b S4b B131k

EVA02 is a large-scale vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks.

Image Classification

Owlvit Base Patch32

OWL-ViT is a zero-shot text-conditioned object detection model that can search for objects in images via text queries without requiring category-specific training data.

Clip Vit Base Patch32

CLIP is a multimodal model developed by OpenAI that can understand the relationship between images and text, supporting zero-shot image classification tasks.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase